Global Data Analysis and the Fragmentation Problem in Decision Tree Induction
نویسندگان
چکیده
We investigate an inherent limitation of top-down decision tree induction in which the continuous partitioning of the instance space progressively lessens the statistical support of every partial (i.e. disjunc-tive) hypothesis, known as the fragmentation problem. We show, both theoretically and empirically, how the fragmentation problem adversely aaects predictive accuracy as variation r (a measure of concept dii-culty) increases. Applying feature-construction techniques at every tree node, which we implement on a decision tree inducer DALI, is proved to only partially solve the fragmentation problem. Our study illustrates how a more robust solution must also assess the value of each partial hypothesis by recurring to all available training data, an approach we name global data analysis, which decision tree induction alone is unable to accomplish. The value of global data analysis is evaluated by comparing modiied versions of C4.5rules with C4.5trees and DALI, on both artiicial and real-world domains. Empirical results suggest the importance of combining both feature construction and global data analysis to solve the fragmentation problem.
منابع مشابه
DIAGNOSIS OF BREAST LESIONS USING THE LOCAL CHAN-VESE MODEL, HIERARCHICAL FUZZY PARTITIONING AND FUZZY DECISION TREE INDUCTION
Breast cancer is one of the leading causes of death among women. Mammography remains today the best technology to detect breast cancer, early and efficiently, to distinguish between benign and malignant diseases. Several techniques in image processing and analysis have been developed to address this problem. In this paper, we propose a new solution to the problem of computer aided detection and...
متن کاملA New Acceptance Sampling Design Using Bayesian Modeling and Backwards Induction
In acceptance sampling plans, the decisions on either accepting or rejecting a specific batch is still a challenging problem. In order to provide a desired level of protection for customers as well as manufacturers, in this paper, a new acceptance sampling design is proposed to accept or reject a batch based on Bayesian modeling to update the distribution function of the percentage of nonconfor...
متن کاملارزیابی عملکرد واحدهای تصمیمگیرنده با استفاده از تحلیل پوششی دادههای پنجرهای و درخت تصمیم
Efficiency is an issue of importance and interest to both managers of different organizations and customers who use the products and services of these organizations. The aim of this research is to study the efficiency of pharmaceutical companies accepted in the Stock Exchange Organization by using Window Data Envelopment Analysis (WDEA) and then, to provide some rules based on the decision tree...
متن کاملEvaluation of liquefaction potential based on CPT results using C4.5 decision tree
The prediction of liquefaction potential of soil due to an earthquake is an essential task in Civil Engineering. The decision tree is a tree structure consisting of internal and terminal nodes which process the data to ultimately yield a classification. C4.5 is a known algorithm widely used to design decision trees. In this algorithm, a pruning process is carried out to solve the problem of the...
متن کاملComparing different stopping criteria for fuzzy decision tree induction through IDFID3
Fuzzy Decision Tree (FDT) classifiers combine decision trees with approximate reasoning offered by fuzzy representation to deal with language and measurement uncertainties. When a FDT induction algorithm utilizes stopping criteria for early stopping of the tree's growth, threshold values of stopping criteria will control the number of nodes. Finding a proper threshold value for a stopping crite...
متن کامل